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Introduction 


The Problem 
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Introduction 


The Problem 


ə As we saw, we can figure out where the time goes 
ə And we can fix some of it 


ə But we can't get there just by hacking the kernel 
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Introduction 


Timing example to one fast network 
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Introduction 


er, what? 


Drop into device write 
Allocate 

Push into the queue 
kick the tree 


send it out 


And all we really had to do was push it into a FIFO 
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Introduction 


So we killed the generality 


But it’s still too slow 

don’t seem to be able to get under 3-4 microseconds 
fdtochan 

Validaddr 

Sort of an irreducible problem 


ə Should we optimize these? 


Ronald G. Minnich and John Floren and Jim Using Currying and process-private system ca January, 2009 6 / 30 


Introduction 


Timing example to one fast network 
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Introduction 


Should we optimize fdtochan and validaddr 


Cache the 32 or so addresses most programs use? 


Cache channels in the proc struct? 
We explored these options 


Makes code in port more complex 


End up with management of va cache in proc struct 
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Introduction 


Such micro-optimization is a common approach 


You see it all over, e.g., Linux 

Lots of heuristics in the code 

Which lead to creation of lots more timing-driven cleanup 

Gets really complex 

Can fail pathologically when programs do not follow common path 
Kernel is trying to intuit your intentions 

ə Got so crazy that Lustre just went ahead and added intents to ops 


ə l.e. when you do ‘walk’ you say ‘and | intend to open this file after 
the walk’ 


ə And all systems pay the costs of micro-optimization, even if only one 
subsystem needs it 
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Introduction 


Some numbers from Linux 


grep -r inline .—wc 

69649 440053 6904168 

grep -r likely .—wc 

18668 108045 1538433 

Note: includes likely and unlikely! 


e.g. kernel/workqueue.c: if (unlikely(cpu >= 0)) 
kernel/workqueue.c: while (unlikely(ret < 0)); 
grep -r heurist . — wc 

197 2052 20268 
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Introduction 


HPC gave up long ago 


ə Around 1986, created interfaces that let programs write direct to 
network[1] 


These were later called OS bypass 
OS did not, could not, know when or how network I/O occurred 


Ubiquotous on top supercomputers 


Problem: Any program, to get good performance, has to be rewritten 
to use HPC networks 


ə HPC network libraries starting to look like OSes 
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Introduction 


But OS bypass is a trap 


ə Sure it works 

ə Just need to write an OS in your library 

ə But most interfaces run an OS anyway 

ə Bypass host OS, but not bypass the network OS 

ə Threading in support libraries continues to be a source of bugs 
ə And these libraries have complexity usually seen only in the OS 


> page colors 
> cache alignment 
> Playing games with interrupts 
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Introduction 


Plan A and B fail 


Optimize all we want, we can not get the OS faster 
Use OS bypass, end up recreating all the problems in the libraries 


and making the HPC interface unusable for non-HPC apps 


The price is too high to pay in either case 


That is where we were last summer 


Clearly in need of Plan C 
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Breaking out of the trap 


One idea — Currying 


Useful technique 

f (x,y) = y/x 

if you know y, e.g. if y is 2 
Create g(x) s.t. g(x) = f (x, 2) 
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Breaking out of the trap 


Exploiting Currying in the kernel interface 


ə Leave information around about, e.g., past write behavior, past 
addresses 


take a chance that it will happen again 
Keep results fo validaddr, fdtochan, etc., around 


Add all the extra book-keeping needed to make it go 


Results from tomorrow’s talk show that 32 addresses and a few 
cached channels might be enough 


@ need to clean them up when process exits 
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Breaking out of the trap 


Currying alone is not enough 


We are still back to intuiting program behavior 


Kernel would have to look for patterns, cache arguments, etc. 
and it still might be wrong 


We got stuck on this problem last summer 


At some point, Currying just looks like more micro-optimizations 
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Breaking out of the trap 


Another approach: Synthesis[2| 


ə Once a function and parameters are known 
ə Generate code on the fly to implement that code 
ə Can take this pretty far 


ə In fact, to the point that repeated reading of a file turns into repeated 
reading of a disk block 


ə Not a great general approach because code synthesis can have bugs 
ə and debugging is near-impossible 


ə Perhaps new techniques such as those found in vx32 might help 
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What we did instead Five easy pieces 


Create a struct which defines a fast path 


struct Fastcall { 

int scnum; 

Chan *c; 

long (*fun)(Chan*, void*, long, vlong); 
void *buf; 


int n; 
vlong off; 
F; 
ə Holds all you need to know about a read/write system call 
@ a pointer to a function 
ə Also add an empty Fastcall * to proc struct 
e@ Used as pointer to base of dynamically allocated array 


Add counter to proc struct for size 
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What we did instead Five easy pieces 


Modify syscall code 


ə There is a path in syscall to cover "syscall number too high” 
e@ Simple change: if number too high, search array of Fastcall structs 


ə If found: up->fcli].fun(ar0, up->fcli].c, up->fcli].buf, 
up->fcli].n, up->fc[i].off); 
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What we did instead Create fast path functions 


Fast call setup function 


ə Find data chan given ctl file fd and incref it 

ə Do validaddr for (va, len) 

ə Other validation as needed for device (e.g. offset) 

@ Check for collisions in user-provided system call number 
ə Realloc up->fc struct, fill in new data 

ə Errors: read/write syscall errrors 

ə Other errors related to fastcall 
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What we did instead Create fast path functions 


Fast call function 


Passed a channel, data pointer, length, offset just as a normal call 
The overhead saved is in the generic read and write syscall 
Fastcall function can use none, some, or all of these 

E.g. on BG/P barrier, we only use the channel and the pointer 
Key is that these are validated 

void * always works because in Plan 9, segments never shrink 


Would be a nightmare in Unix 
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What we did instead User mode setup 


Special syscall code 


TEXT mysyscall1+0(SB) ,0,$8 
MOVL at+O(FP) , AX 

BYTE $0xcd; BYTE $0x40 
RET 


ə Not a lot there 


e@ There's only one argument: the system call number 
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What we did instead User mode setup 


fastpath command 


ctl = open(TEMP "/ct1", ORDWR) ; 

cmd = smprint("fastwrite 256 %d Oxip vd", fd0, data, 
sizeof (data) ); 

res = write(ctl, cmd, strlen(cmd)); 


Must have syscall number and fd 
Typically there would a pointer and size 


And maybe an offset 


On success, invoke with mysyscall1 (256) ; 
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Performance 


Blue Gene barrier network 


Normal path: too long 


More optimized path: 3.6 microseconds 


Fastcall: 770 nanoseconds 


Five-fold performance improvement 


With very little code change 
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Performance 


But it’s not a panacea 


We modified the pipe device 
Performance was no better and in some cases worse 
Why? queues are slow, slow, slow 


For slow interfaces, current interface is fine 


Shouldn't we fix those however? 
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Contributions, conclusions, future work 


Kernels have been stuck in a rut 


@ We know they’re too slow 
ə People in HPC gave up years ago, moved to OS bypass 
@ Kernel path has just gotten slower 


ə App libraries have gotten more like kernels 
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Contributions, conclusions, future work 


Key contribution: private system calls + compile-time 
fastpath + run-time currying 


Private system calls support the fastpath function 
ə But could do much more; it is up to the driver 


Compile-time fastpath eliminates Synthesis issues with runtime code 
generation 


ə Run-time currying parameterizes compile-time fastpath 


ə Result, overall, is a new way to provide a fast path to programs 


Allows us to focus optimizations where needed, ignore them where not 


i.e. not all devices need this, so no need to modify them all 
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Contributions, conclusions, future work 


Conclusions 


ə We feel this can enable much higher I/O performance 

@ Need faster and pipe device 

ə Should be able to make fossil to venti to sdC path faster 
ə Could we have a fastpath through the network stack? 
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Contributions, conclusions, future work 


Next steps 


ə Make kernel-based comms on supercomputers run as well as 
OS-bypass-based comms 


@ On non-supercomputers, provide IPC that makes user-level-servers 
run as well as in-kernel servers 


ə No fastpath for read done yet 
Do not see that as a huge problem 
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Contributions, conclusions, future work 


Further reading 


Eg 


Stephen Menke, Mark Moir, and Srikanth Ramamurthy. 
Synchronization mechanisms for scramnet+ systems. 

In PODC '98: Proceedings of the seventeenth annual ACM 
symposium on Principles of distributed computing, pages 71-80, New 
York, NY, USA, 1998. ACM Press. 
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